Disentangled Motif-aware Graph Learning for Phrase Grounding
نویسندگان
چکیده
In this paper, we propose a novel graph learning framework for phrase grounding in the image. Developing from sequential to dense model, existing works capture coarse-grained context but fail distinguish diversity of among phrases and image regions. contrast, pay special attention different motifs implied scene devise disentangled network integrate motif-aware contextual information into representations. Besides, adopt interventional strategies at feature structure levels consolidate generalize Finally, cross-modal is utilized fuse intra-modal features, where each can be computed similarity with regions select best-grounded one. We validate efficiency (DIGN) through series ablation studies, our model achieves state-of-the-art performance on Flickr30K Entities ReferIt Game benchmarks.
منابع مشابه
Scalable Motif-aware Graph Clustering
We develop new methods based on graph motifs for graph clustering, allowing more efficient detection of communities within networks. We focus on triangles within graphs, but our techniques extend to other clique motifs as well. Our intuition, which has been suggested but not formalized similarly in previous works, is that triangles are a better signature of community than edges. We therefore ge...
متن کاملLinear Disentangled Representation Learning for Facial Actions
Limited annotated data available for the recognition of facial expression and action units embarrasses the training of deep networks, which can learn disentangled invariant features. However, a linear model with just several parameters normally is not demanding in terms of training data. In this paper, we propose an elegant linear model to untangle confounding factors in challenging realistic m...
متن کاملKnowledge Aided Consistency for Weakly Supervised Phrase Grounding
Given a natural language query, a phrase grounding system aims to localize mentioned objects in an image. In weakly supervised scenario, mapping between image regions (i.e., proposals) and language is not available in the training set. Previous methods address this deficiency by training a grounding system via learning to reconstruct language information contained in input queries from predicte...
متن کاملDna-gan: Learning Disentangled Represen-
Disentangling factors of variation has always been a challenging problem in representation learning. Existing algorithms suffer from many limitations, such as unpredictable disentangling factors, bad quality of generated images from encodings, lack of identity information, etc. In this paper, we proposed a supervised algorithm called DNA-GAN trying to disentangle different attributes of images....
متن کاملUnsupervised Learning of Disentangled Representations from Video
We present a new model DRNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the ti...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2021
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v35i15.17602